Data-Intensive Applications on HPC Using Hadoop, Spark and RADICAL-Cybertools

Shantenu Jha and Andre Luckow

The tutorial material is available as iPython notebook on Github:

Requirements and Setup:

For the purpose of this tutorial we setup a Hadoop cluster and iPython Notebook environment on Amazon Web Services (not active after tutorial):

Enclosed a list of dependency for installation on other machines:

  • iPython
  • Numpy
  • Pandas
  • Scikit-Learn
  • Matplotlib, Seaborn

We recommend to use Anaconda.

1. Hadoop and Spark Introduction

We begin with an overview of using Hadoop and Spark:

Hadoop MapReduce: Link to Notebook

Spark: Link to Notebook

2. Pilot-Abstraction for distributed HPC and Apache Hadoop Big Data Stack (ABDS)

The Pilot-Abstraction has been used to execute task-based workloads on distributed resources. A Pilot-Job is a placeholder job that is submitting to the resource management system and is used as a container for a dynamically determined set of compute tasks. The Pilot-Data abstraction extends the Pilot-Abstraction for supporting the management of data in conjunction with compute tasks.

The Pilot-Abstraction supports heterogeneous resources, including cloud, HPC, and Hadoop resources.

The following example demonstrates how the Pilot-Abstraction is used to manage a set of compute tasks.

Link to Notebook

3. Advanced Analytics

The following pairplots show the scatter-plot between each of the four features. Clusters for the different species are indicated by the color.

Link to Notebook

4. Future Work: Midas